15  Data Grouping and Aggregation

15.1 Why Grouping and Aggregation Matter in Data Science

Grouping and aggregation are foundational techniques in data analysis that transform raw data into actionable insights. Here’s why they’re essential:

Data Summarization
Grouping allows you to condense large amounts of data into concise summaries. Aggregation provides statistical summaries (e.g., mean, sum, count), making it easier to understand data trends and characteristics without needing to examine every detail.

Insights Across Categories
Grouping by categories—like regions, demographics, or time periods—lets you analyze patterns within each group. For example:

  • Aggregating sales data by region can reveal which areas are performing better.
  • Grouping customer data by age group can highlight consumer trends.
  • Comparing product performance across different market segments.

Time Series Analysis
In time-series data, grouping by time intervals (e.g., day, month, quarter) allows you to observe trends over time, which is crucial for forecasting, seasonal analysis, and trend detection.

Reducing Data Complexity
By summarizing data at a higher level, you reduce complexity, making it easier to visualize and interpret—especially in large datasets where working with raw data could be overwhelming.

In this chapter, we’ll explore grouping and aggregating using pandas. These methods will help you group and summarize data, making complex analysis comparatively easy.

Throughout this chapter, we’ll use the GDP and Life Expectancy dataset (gdp_lifeExpectancy.csv), which contains information about countries’ GDP per capita and life expectancy over time. This dataset is ideal for demonstrating how grouping can reveal patterns across countries, continents, and years.

Let’s start by importing the necessary libraries and loading the data.

# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set visualization defaults
sns.set(font_scale=1.5)
%matplotlib inline
# Load the GDP and Life Expectancy dataset
gdp_lifeExp_data = pd.read_csv('./Datasets/gdp_lifeExpectancy.csv')

# Display the first few rows to understand the structure
gdp_lifeExp_data.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

15.2 Categorical Aggregation

Before diving into advanced grouping operations, it’s important to understand how to aggregate categorical variables. Unlike numerical aggregation (mean, sum, etc.), categorical aggregation focuses on counting frequencies and understanding the distribution of categories.

Categorical aggregation helps answer questions like:

  • How many observations belong to each category?
  • What’s the distribution across different categories?
  • How do two categorical variables relate to each other?

15.2.1 One-way Aggregation: value_counts()

The value_counts() method is the simplest way to count occurrences of each unique value in a categorical column. It returns a Series with categories as the index and their counts as values.

Example: Let’s count how many observations (country-year combinations) we have for each continent.

# Count the number of observations for each continent
continent_counts = gdp_lifeExp_data['continent'].value_counts()
continent_counts
continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

Interpretation: Africa has the most observations (624), while Oceania has the fewest (24). This tells us the dataset has unequal representation across continents.

15.2.1.1 Useful value_counts() Parameters

The value_counts() method has several useful parameters:

  • normalize=True - Returns proportions instead of counts
  • sort=False - Returns results in the order they appear (not sorted by frequency)
  • dropna=False - Includes counts of missing values

Let’s see the proportion of observations in each continent:

# Get proportions instead of counts
continent_proportions = gdp_lifeExp_data['continent'].value_counts(normalize=True)
print("Proportions of observations by continent:")
print(continent_proportions)
print(f"\nAfrica represents {continent_proportions['Africa']:.1%} of all observations")
Proportions of observations by continent:
continent
Africa      0.366197
Asia        0.232394
Europe      0.211268
Americas    0.176056
Oceania     0.014085
Name: proportion, dtype: float64

Africa represents 36.6% of all observations

15.2.2 Two-way Aggregation: crosstab()

While value_counts() works for a single categorical variable, crosstab() is used to examine the relationship between two categorical variables. It creates a frequency table (also called a contingency table) showing counts for each combination of categories.

Use case: crosstab() is ideal for:

  • Understanding how categories from two variables overlap
  • Checking data balance across multiple dimensions
  • Creating frequency tables for statistical analysis

The crosstab() method is a special case of a pivot table specifically designed for computing group frequencies (or the size of each group).

15.2.2.1 Basic crosstab() Example

Let’s start with a simple one-way frequency count using crosstab(). While value_counts() is more direct for this purpose, crosstab() can do it too:

# Create a basic crosstab to count observations by continent
# The 'columns' parameter is just a label for the column header
pd.crosstab(gdp_lifeExp_data['continent'], columns='count')
col_0 count
continent
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24

15.2.2.2 Two-way Frequency Table

Now let’s create a two-way frequency table to see how observations are distributed across both continent and year. This helps us understand the temporal coverage for each continent.

Adding Margins:
Use the margins=True argument to add row and column totals. The All row and column provide the sum across each dimension.

# Create a two-way frequency table for continent and year
continent_year_table = pd.crosstab(
    gdp_lifeExp_data['continent'], 
    gdp_lifeExp_data['year'], 
    margins=True
)

# Display the table
continent_year_table
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 All
continent
Africa 52 52 52 52 52 52 52 52 52 52 52 52 624
Americas 25 25 25 25 25 25 25 25 25 25 25 25 300
Asia 33 33 33 33 33 33 33 33 33 33 33 33 396
Europe 30 30 30 30 30 30 30 30 30 30 30 30 360
Oceania 2 2 2 2 2 2 2 2 2 2 2 2 24
All 142 142 142 142 142 142 142 142 142 142 142 142 1704

Interpretation:

  • Each cell shows the count of country-year observations for that continent-year combination
  • The All column shows total observations per continent (same as value_counts())
  • The All row shows total observations per year
  • The bottom-right cell (1704) is the grand total of all observations

This table is useful for:

  • Checking if data is representative across all groups
  • Identifying any missing years for specific continents
  • Understanding the temporal balance of your dataset

15.2.2.3 Using crosstab() with Aggregation Functions

So far, we’ve used crosstab() to count frequencies. However, you can also use it to aggregate numerical values for each category combination by specifying:

  • values - The numerical column to aggregate
  • aggfunc - The aggregation function to apply (e.g., mean, sum, median)

This makes crosstab() a powerful tool that combines categorical grouping with numerical aggregation.

Example: Calculate the mean life expectancy for each continent-year combination.

This helps us see how life expectancy has evolved over time across different continents.

# Calculate mean life expectancy for each continent-year combination
mean_lifeExp_table = pd.crosstab(
    gdp_lifeExp_data['continent'], 
    gdp_lifeExp_data['year'], 
    values=gdp_lifeExp_data['lifeExp'], 
    aggfunc='mean'
)

# Round to 1 decimal place for better readability
mean_lifeExp_table = mean_lifeExp_table.round(1)

# Display the table
mean_lifeExp_table
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 39.1 41.3 43.3 45.3 47.5 49.6 51.6 53.3 53.6 53.6 53.3 54.8
Americas 53.3 56.0 58.4 60.4 62.4 64.4 66.2 68.1 69.6 71.2 72.4 73.6
Asia 46.3 49.3 51.6 54.7 57.3 59.6 62.6 64.9 66.5 68.0 69.2 70.7
Europe 64.4 66.7 68.5 69.7 70.8 71.9 72.8 73.6 74.4 75.5 76.7 77.6
Oceania 69.3 70.3 71.1 71.3 71.9 72.9 74.3 75.3 76.9 78.2 79.7 80.7

Key Insights:

  • Life expectancy has generally increased over time across all continents
  • Oceania consistently has the highest life expectancy
  • Africa has the lowest life expectancy, though it’s improving
  • The gap between continents is gradually narrowing over time

While crosstab() is excellent for two-way categorical analysis, groupby() offers significantly more flexibility:

Think of it as a progression:

  • value_counts() → Count frequencies for one variable
  • crosstab() → Cross-tabulate two categorical variables with optional aggregation
  • groupby() → The most flexible and powerful grouping tool for all scenarios

In the next section, we’ll explore groupby() in depth, starting with single-column grouping and gradually building up to more complex operations.

15.3 groupby(): Grouping by a Single Column

The groupby() method allows you to split a DataFrame based on the values in one or more columns, creating groups that can be analyzed separately. Unlike crosstab(), which is optimized for creating cross-tabulation tables, groupby() gives you complete control over how to aggregate, transform, and filter your data.

15.3.1 Syntax of groupby()

DataFrame.groupby(by="column_name")

This creates a GroupBy object that represents the grouped data.

15.3.2 Example: Grouping by Continent

Let’s group the life expectancy data by continent. This will allow us to analyze statistics separately for each continent.

# Create a GroupBy object by grouping data by 'continent'
grouped = gdp_lifeExp_data.groupby('continent')

# This logically splits the data into groups based on unique values of 'continent'
# However, the data is not physically split into separate DataFrames
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014936493FE0>

15.3.3 Understanding the GroupBy Object

The groupby() method returns a GroupBy object, not a DataFrame. This object contains information about how the data is grouped, but doesn’t display the groups directly.

# Check the type of the grouped object
type(grouped)
pandas.core.groupby.generic.DataFrameGroupBy

Key Point: The GroupBy object grouped contains the information about how observations are distributed across groups. Each observation has been assigned to a specific group based on the value of continent for that observation. However, the dataset is not physically split into different DataFrames—all observations remain in the same DataFrame gdp_lifeExp_datauntil you apply an aggregation or transformation.

15.3.4 Exploring GroupBy Objects: Attributes and Methods

The GroupBy object has several useful attributes and methods that help you understand and work with your grouped data. Let’s explore the most important ones.

15.3.4.1 keys - Identifying the Grouping Column(s)

The column(s) used to group the data are called keys. You can view the grouping key(s) using the keys attribute.

# View the grouping key(s)
grouped.keys
'continent'

15.3.4.2 ngroups - Counting the Number of Groups

The ngroups attribute tells you how many unique groups exist based on the grouping key(s).

# Count the number of groups
grouped.ngroups
5

15.3.4.3 groups - Viewing Group Names and Their Members

The groups attribute returns a dictionary where:

  • Keys are the group names (unique values of the grouping column)
  • Values are the row indices (labels) of observations belonging to each group
# View the group names (keys of the dictionary)
grouped.groups.keys()
dict_keys(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'])
# View the complete groups dictionary (group names and their row indices)
grouped.groups
{'Africa': [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, ...], 'Americas': [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 432, 433, 434, 435, ...], 'Asia': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, ...], 'Europe': [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 516, 517, 518, 519, ...], 'Oceania': [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103]}

You can also access just the row indices (values) for all groups:

# View the row indices for all groups
grouped.groups.values()
dict_values([Index([  24,   25,   26,   27,   28,   29,   30,   31,   32,   33,
       ...
       1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703],
      dtype='int64', length=624), Index([  48,   49,   50,   51,   52,   53,   54,   55,   56,   57,
       ...
       1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643],
      dtype='int64', length=300), Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
       ...
       1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679],
      dtype='int64', length=396), Index([  12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
       ...
       1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607],
      dtype='int64', length=360), Index([  60,   61,   62,   63,   64,   65,   66,   67,   68,   69,   70,   71,
       1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103],
      dtype='int64')])

15.3.4.4 size() - Counting Observations per Group

The size() method returns the number of observations in each group. This is useful for understanding the distribution of data across groups.

# Count the number of observations in each continent group
grouped.size()
continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

15.3.4.5 first() - Viewing the First Element of Each Group

The first() method returns the first non-missing element from each group. This can be useful for quickly inspecting what each group looks like.

# View the first non-missing observation from each continent group
grouped.first()
country year lifeExp pop gdpPercap
continent
Africa Algeria 1952 43.077 9279525 2449.008185
Americas Argentina 1952 62.485 17876956 5911.315053
Asia Afghanistan 1952 28.801 8425333 779.445314
Europe Albania 1952 55.230 1282697 1601.056136
Oceania Australia 1952 69.120 8691212 10039.595640

15.3.4.6 get_group() - Extracting Data for a Specific Group

The get_group() method returns all observations belonging to a particular group. This is useful when you want to analyze or visualize a specific group in detail.

# Extract all observations for the 'Asia' continent
grouped.get_group('Asia').head(10)
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
5 Afghanistan Asia 1977 38.438 14880372 786.113360
6 Afghanistan Asia 1982 39.854 12881816 978.011439
7 Afghanistan Asia 1987 40.822 13867957 852.395945
8 Afghanistan Asia 1992 41.674 16317921 649.341395
9 Afghanistan Asia 1997 41.763 22227415 635.341351

15.4 Data Aggregation Within Groups

Once you’ve created a GroupBy object, the real power comes from applying aggregation functions to summarize the data within each group. Aggregation functions compute summary statistics that help you understand patterns and trends across different groups.

15.4.1 Common Aggregation Functions

Below are the most commonly used aggregation functions when working with grouped data in pandas:

Function Description Example Use Case
mean() Calculates the average value Average life expectancy per continent
sum() Computes the total by summing all values Total population across countries
min() Finds the minimum value Lowest GDP per capita in each region
max() Finds the maximum value Highest life expectancy in each year
count() Counts the number of non-null entries Number of countries per continent
median() Finds the middle value when sorted Median income across categories
std() Calculates the standard deviation Variability in GDP within continents

Each of these functions helps summarize different aspects of your grouped data, making it easier to identify trends, outliers, and patterns.

15.4.2 Example: Calculating Mean Statistics by Country

Let’s find the mean life expectancy, population, and GDP per capita for each country during the period 1952-2007. This will give us a single summary statistic for each country across all years in the dataset.

Approach:

  1. Remove columns we don’t want to aggregate (continent and year)
  2. Group by country
  3. Calculate the mean for all remaining numeric columns
# Group the data by 'country' and calculate mean statistics
# First, drop columns we don't want to aggregate
grouped_country = gdp_lifeExp_data.drop(['continent', 'year'], axis=1).groupby('country')

# Calculate the mean for all numeric columns (lifeExp, pop, gdpPercap)
country_means = grouped_country.mean()

# Display the results
country_means.head(10)
lifeExp pop gdpPercap
country
Afghanistan 37.478833 1.582372e+07 802.674598
Albania 68.432917 2.580249e+06 3255.366633
Algeria 59.030167 1.987541e+07 4426.025973
Angola 37.883500 7.309390e+06 3607.100529
Argentina 69.060417 2.860224e+07 8955.553783
Australia 74.662917 1.464931e+07 19980.595634
Austria 73.103250 7.583298e+06 20411.916279
Bahrain 65.605667 3.739132e+05 18077.663945
Bangladesh 49.834083 9.075540e+07 817.558818
Belgium 73.641750 9.725119e+06 19900.758072

15.4.3 Calculating Other Statistics

We can apply different aggregation functions to understand the variability and spread of data within each group. Let’s calculate the standard deviation to see how much variation exists within each country over time.

# Calculate the standard deviation for each country
country_std = grouped_country.std()

# Display the results
country_std.head(10)
lifeExp pop gdpPercap
country
Afghanistan 5.098646 7.114583e+06 108.202929
Albania 6.322911 8.285855e+05 1192.351513
Algeria 10.340069 8.613355e+06 1310.337656
Angola 4.005276 2.672281e+06 1165.900251
Argentina 4.186470 7.546609e+06 1862.583151
Australia 4.147774 3.915203e+06 7815.405220
Austria 4.379838 4.376600e+05 9655.281488
Bahrain 8.571871 2.108933e+05 5415.413364
Bangladesh 9.028335 3.471166e+07 235.079648
Belgium 3.779658 5.206359e+05 8391.186269

Interpretation: Countries with higher standard deviations in lifeExp experienced greater changes in life expectancy over the years, while those with lower standard deviations remained more stable.

You can apply any of the aggregation functions mentioned earlier (min(), max(), count(), median(), etc.) in the same way to compute different statistics for your groups.

15.5 Multiple and Custom Aggregations Using agg()

While applying single aggregation functions like mean() or std() is useful, you’ll often need to:

  1. Apply multiple aggregation functions simultaneously
  2. Use custom aggregation functions that aren’t built into pandas

The agg() method of a GroupBy object provides this flexibility.

15.5.1 Multiple Aggregations on a Single Column

When you need to compute several statistics for the same column, pass a list of function names to agg().

Example: Calculate both the mean and standard deviation of GDP per capita for each country.

# Apply multiple aggregation functions to 'gdpPercap' column
gdp_stats = grouped_country['gdpPercap'].agg(['mean', 'std'])

# Sort by mean GDP per capita (descending) and show top 10
gdp_stats.sort_values(by='mean', ascending=False).head(10)
mean std
country
Kuwait 65332.910472 33882.139536
Switzerland 27074.334405 6886.463308
Norway 26747.306554 13421.947245
United States 26261.151347 9695.058103
Canada 22410.746340 8210.112789
Netherlands 21748.852208 8918.866411
Denmark 21671.824888 8305.077866
Germany 20556.684433 8076.261913
Iceland 20531.422272 9373.245893
Austria 20411.916279 9655.281488

15.5.2 Custom Aggregation Functions

In addition to built-in functions, you can create your own custom aggregation functions. This is useful when you need a statistic that isn’t available in pandas.

Example: Calculate the range (max - min) of GDP per capita for each country using a lambda function.

# Calculate the range (max - min) of gdpPercap for each country using a lambda function
gdp_range = grouped_country['gdpPercap'].agg(lambda x: x.max() - x.min())

# Display the results sorted by range (descending)
gdp_range.sort_values(ascending=False).head(10)
country
Kuwait              85404.702920
Singapore           44828.041413
Norway              39261.768450
Hong Kong, China    36670.557461
Ireland             35465.716022
Austria             29989.416208
United States       28961.171010
Iceland             28913.100762
Japan               28439.111713
Netherlands         27856.361462
Name: gdpPercap, dtype: float64
# Alternatively, define a named function for better readability
def range_func(x):
    """Calculate the range (max - min) of a series"""
    return x.max() - x.min()
# Combine built-in functions with custom functions
gdp_complete_stats = grouped_country['gdpPercap'].agg(['mean', 'std', range_func])

# Sort by range and display top 10 countries
gdp_complete_stats.sort_values(by='range_func', ascending=False).head(10)
mean std range_func
country
Kuwait 65332.910472 33882.139536 85404.702920
Singapore 17425.382267 14926.147774 44828.041413
Norway 26747.306554 13421.947245 39261.768450
Hong Kong, China 16228.700865 12207.329731 36670.557461
Ireland 15758.606238 11573.311022 35465.716022
Austria 20411.916279 9655.281488 29989.416208
United States 26261.151347 9695.058103 28961.171010
Iceland 20531.422272 9373.245893 28913.100762
Japan 17750.869984 10131.612545 28439.111713
Netherlands 21748.852208 8918.866411 27856.361462

15.5.3 Renaming Aggregated Columns

For better readability and professional reporting, you can rename the columns resulting from aggregation. Pass a list of tuples to agg(), where each tuple contains:

  1. The new column name (string)
  2. The aggregation function (string or callable)

Example: Create a summary table with custom column names and include a 90th percentile calculation.

# Apply multiple aggregations with custom column names
gdp_summary = grouped_country['gdpPercap'].agg(
    [('Average', 'mean'),
     ('Standard Deviation', 'std'),
    ('90th Percentile', lambda x: x.quantile(0.9))
    ])

# Sort by 90th percentile and display top 10
gdp_summary.sort_values(by='90th Percentile', ascending=False).head(10)
Average Standard Deviation 90th Percentile
country
Kuwait 65332.910472 33882.139536 109251.315590
Norway 26747.306554 13421.947245 44343.894158
United States 26261.151347 9695.058103 38764.132898
Singapore 17425.382267 14926.147774 35772.742520
Switzerland 27074.334405 6886.463308 34246.394240
Netherlands 21748.852208 8918.866411 33376.895065
Ireland 15758.606238 11573.311022 33121.539164
Canada 22410.746340 8210.112789 32891.561152
Saudi Arabia 20261.743635 8754.387440 32808.019502
Austria 20411.916279 9655.281488 32085.438987

15.6 Multiple Aggregations on Multiple Columns

Often, you’ll want to apply the same aggregation functions to several columns simultaneously. This is straightforward with agg().

Example: Calculate the mean and standard deviation of both life expectancy (lifeExp) and population (pop) for each country.

# Apply the same aggregations to multiple columns
multi_column_stats = grouped_country[['lifeExp', 'pop']].agg(['mean', 'std'])

# Sort by mean life expectancy (descending)
multi_column_stats.sort_values(by=('lifeExp', 'mean'), ascending=False).head(10)
lifeExp pop
mean std mean std
country
Iceland 76.511417 3.026593 2.269781e+05 4.854168e+04
Sweden 76.177000 3.003990 8.220029e+06 6.365660e+05
Norway 75.843000 2.423994 4.031441e+06 4.107955e+05
Netherlands 75.648500 2.486363 1.378680e+07 2.005631e+06
Switzerland 75.565083 4.011572 6.384293e+06 8.582009e+05
Canada 74.902750 3.952871 2.446297e+07 5.940793e+06
Japan 74.826917 6.494629 1.117588e+08 1.488988e+07
Australia 74.662917 4.147774 1.464931e+07 3.915203e+06
Denmark 74.370167 2.220111 4.994187e+06 3.520599e+05
France 74.348917 4.304761 5.295256e+07 6.086809e+06

15.7 Different Aggregations for Different Columns

Sometimes you need to apply different sets of functions to different columns. For this, pass a dictionary to agg() where:

  • Keys are column names
  • Values are lists of aggregation functions (as strings or callables)

This approach gives you complete control over which statistics are computed for each column.

# Apply different aggregation functions to different columns using a dictionary
varied_stats = grouped_country.agg({
    'gdpPercap': ['mean', 'std'],      # Mean and std for GDP per capita
    'lifeExp': ['median', 'std'],      # Median and std for life expectancy
    'pop': ['max', 'min']              # Max and min for population
})

# Display the first 10 rows
varied_stats.head(10)
gdpPercap lifeExp pop
mean std median std max min
country
Afghanistan 802.674598 108.202929 39.1460 5.098646 31889923 8425333
Albania 3255.366633 1192.351513 69.6750 6.322911 3600523 1282697
Algeria 4426.025973 1310.337656 59.6910 10.340069 33333216 9279525
Angola 3607.100529 1165.900251 39.6945 4.005276 12420476 4232095
Argentina 8955.553783 1862.583151 69.2115 4.186470 40301927 17876956
Australia 19980.595634 7815.405220 74.1150 4.147774 20434176 8691212
Austria 20411.916279 9655.281488 72.6750 4.379838 8199783 6927772
Bahrain 18077.663945 5415.413364 67.3225 8.571871 708573 120447
Bangladesh 817.558818 235.079648 48.4660 9.028335 150448339 46886859
Belgium 19900.758072 8391.186269 73.3650 3.779658 10392226 8730405

15.7.1 Combining Dictionary Syntax with Custom Column Names

You can also use the dictionary approach with custom column names by passing tuples as the aggregation values.

Example: For each country, calculate:

  • Mean and standard deviation of life expectancy (with custom names)
  • Min and max values of GDP per capita
# Apply different functions to different columns with custom names
custom_varied_stats = grouped_country.agg({
    'lifeExp': [('Average', 'mean'), 
    ('Standard deviation', 'std')],
    'gdpPercap': ['min', 'max']
})

# Display the first 10 rows
custom_varied_stats.head(10)
lifeExp gdpPercap
Average Standard deviation min max
country
Afghanistan 37.478833 5.098646 635.341351 978.011439
Albania 68.432917 6.322911 1601.056136 5937.029526
Algeria 59.030167 10.340069 2449.008185 6223.367465
Angola 37.883500 4.005276 2277.140884 5522.776375
Argentina 69.060417 4.186470 5911.315053 12779.379640
Australia 74.662917 4.147774 10039.595640 34435.367440
Austria 73.103250 4.379838 6137.076492 36126.492700
Bahrain 65.605667 8.571871 9867.084765 29796.048340
Bangladesh 49.834083 9.028335 630.233627 1391.253792
Belgium 73.641750 3.779658 8343.105127 33692.605080

15.8 Grouping by Multiple Columns

Above, we demonstrated grouping by a single column, which is useful for summarizing data based on one categorical variable. However, in many cases, we need to group by multiple columns. Grouping by multiple columns allows us to create more detailed summaries by accounting for multiple categorical variables. This approach enables us to analyze data at a finer granularity, revealing insights that might be missed with single-column grouping alone.

15.8.1 Basic Syntax for Grouping by Multiple Columns

Use groupby() with a list of column names to group data by multiple columns.

  • DataFrame.groupby(by=["col1", "col2"])

Consider the life expectancy dataset, we can group by both country and continent to analyze gdpPercap, lifeExp, and pop trends for each country within each continent, providing a more comprehensive view of the data.

#Grouping by multiple columns
grouped_continent_contry = gdp_lifeExp_data.groupby(['continent', 'country'])[ "lifeExp"].agg(['mean', 'std', 'max', 'min']).sort_values(by = 'mean', ascending = False)
grouped_continent_contry
mean std max min
continent country
Europe Iceland 76.511417 3.026593 81.757 72.490
Sweden 76.177000 3.003990 80.884 71.860
Norway 75.843000 2.423994 80.196 72.670
Netherlands 75.648500 2.486363 79.762 72.130
Switzerland 75.565083 4.011572 81.701 69.620
... ... ... ... ... ...
Africa Mozambique 40.379500 4.599184 46.344 31.286
Guinea-Bissau 39.210250 4.937369 46.388 32.500
Angola 37.883500 4.005276 42.731 30.015
Asia Afghanistan 37.478833 5.098646 43.828 28.801
Africa Sierra Leone 36.769167 3.937828 42.568 30.331

142 rows × 4 columns

15.8.2 Understanding Hierarchical (Multi-Level) Indexing

  • Grouping by multiple columns creates a hierarchical index (also called a multi-level index).
  • This index allows each level (e.g., continent, country) to act as an independent category that can be accessed individually.

In the above output, continent and country form a two-level hierarchical index, allowing us to drill down from continent-level to country-level summaries.

grouped_continent_contry.index.nlevels
2
# get the first level of the index
grouped_continent_contry.index.levels[0]
Index(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], dtype='object', name='continent')
# get the second level of the index
grouped_continent_contry.index.levels[1]
Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia',
       'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       ...
       'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Venezuela',
       'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=142)

15.8.3 Subsetting Data in a Hierarchical Index

grouped_continent_country is still a DataFrame with hierarchical indexing. You can use .loc[] for subsetting, just as you would with a single-level index.

# get the observations for the 'Americas' continent
grouped_continent_contry.loc['Americas'].head()
mean std max min
country
Canada 74.902750 3.952871 80.653 68.750
United States 73.478500 3.343781 78.242 68.440
Puerto Rico 72.739333 3.984267 78.746 64.280
Cuba 71.045083 6.022798 78.273 59.421
Uruguay 70.781583 3.342937 76.384 66.071
# get the mean life expectancy for the 'Americas' continent
grouped_continent_contry.loc['Americas']['mean'].head()
country
Canada           74.902750
United States    73.478500
Puerto Rico      72.739333
Cuba             71.045083
Uruguay          70.781583
Name: mean, dtype: float64
# another way to get the mean life expectancy for the 'Americas' continent
grouped_continent_contry.loc['Americas', 'mean'].head()
country
Canada           74.902750
United States    73.478500
Puerto Rico      72.739333
Cuba             71.045083
Uruguay          70.781583
Name: mean, dtype: float64

You can use a tuple to access data for specific levels in a multi-level index.

# get the observations for the 'United States' country
grouped_continent_contry.loc[( 'Americas', 'United States')]
mean    73.478500
std      3.343781
max     78.242000
min     68.440000
Name: (Americas, United States), dtype: float64
grouped_continent_contry.loc[( 'Americas', 'United States'), ['mean', 'std']]
mean    73.478500
std      3.343781
Name: (Americas, United States), dtype: float64
gdp_lifeExp_data.columns
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

Finally, you can use reset_index() to convert the hierarchical index into a regular index, allowing you to apply the standard subsetting and filtering methods covered in previous chapters

grouped_continent_contry.reset_index().head()
continent country mean std max min
0 Europe Iceland 76.511417 3.026593 81.757 72.49
1 Europe Sweden 76.177000 3.003990 80.884 71.86
2 Europe Norway 75.843000 2.423994 80.196 72.67
3 Europe Netherlands 75.648500 2.486363 79.762 72.13
4 Europe Switzerland 75.565083 4.011572 81.701 69.62

15.8.4 Grouping by multiple columns and aggregating multiple variables

#Grouping by multiple columns
grouped_continent_contry_multi = gdp_lifeExp_data.groupby(['continent', 'country','year'])[ ['lifeExp', 'pop', 'gdpPercap']].agg(['mean', 'max', 'min'])
grouped_continent_contry_multi
lifeExp pop gdpPercap
mean max min mean max min mean max min
continent country year
Africa Algeria 1952 43.077 43.077 43.077 9279525.0 9279525 9279525 2449.008185 2449.008185 2449.008185
1957 45.685 45.685 45.685 10270856.0 10270856 10270856 3013.976023 3013.976023 3013.976023
1962 48.303 48.303 48.303 11000948.0 11000948 11000948 2550.816880 2550.816880 2550.816880
1967 51.407 51.407 51.407 12760499.0 12760499 12760499 3246.991771 3246.991771 3246.991771
1972 54.518 54.518 54.518 14760787.0 14760787 14760787 4182.663766 4182.663766 4182.663766
... ... ... ... ... ... ... ... ... ... ... ...
Oceania New Zealand 1987 74.320 74.320 74.320 3317166.0 3317166 3317166 19007.191290 19007.191290 19007.191290
1992 76.330 76.330 76.330 3437674.0 3437674 3437674 18363.324940 18363.324940 18363.324940
1997 77.550 77.550 77.550 3676187.0 3676187 3676187 21050.413770 21050.413770 21050.413770
2002 79.110 79.110 79.110 3908037.0 3908037 3908037 23189.801350 23189.801350 23189.801350
2007 80.204 80.204 80.204 4115771.0 4115771 4115771 25185.009110 25185.009110 25185.009110

1704 rows × 9 columns

Breaking Down Grouping and Aggregation

  • Grouping by Multiple Columns:
    In this example, we are grouping the data by three columns: continent, country, and year. This creates groups based on unique combinations of these columns.

  • Aggregating Multiple Variables:
    We apply multiple aggregation functions (mean, std, max, and min) to multiple variables (lifeExp, pop, and gdpPercap).

This type of operation is commonly referred to as multi-column grouping with multiple aggregations in pandas. It’s a powerful approach because it allows us to obtain a detailed statistical summary for each combination of grouping columns across several variables.

# its columns are also two levels deep
grouped_continent_contry_multi.columns.nlevels
2
# pass a tuple to the loc() method to access the values of the multi-level columns with a multi-level index
grouped_continent_contry_multi.loc[('Americas','United States'), ('lifeExp', 'mean')]
year
1952    68.440
1957    69.490
1962    70.210
1967    70.760
1972    71.340
1977    73.380
1982    74.650
1987    75.020
1992    76.090
1997    76.810
2002    77.310
2007    78.242
Name: (lifeExp, mean), dtype: float64

15.9 Advanced Operations within groups: apply(), transform(), and filter()

15.9.1 Using apply() on groups

The apply() function applies a custom function to each group, allowing for flexible operations. The function can return either a scalar, Series, or DataFrame.

Example: Consider the life expectancy dataset, find the top 3 life expectancy values for each continent

We’ll first define a function that sorts a dataset by decreasing life expectancy and returns the top 3 rows. Then, we’ll apply this function on each group using the apply() method of the GroupBy object.

# Define a function to get the top 3 rows based on life expectancy for each group
def top_3_life_expectancy(group):
    return group.nlargest(3, 'lifeExp')
#Defining the groups in the data
grouped_gdpcapital_data = gdp_lifeExp_data.groupby('continent')

Now we’ll use the apply() method to apply the top_3_life_expectancy() function on each group of the object grouped_gdpcapital_data.

# Apply the function to each continent group
top_life_expectancy = gdp_lifeExp_data.groupby('continent')[['continent', 'country', 'year', 'lifeExp', 'gdpPercap']].apply(top_3_life_expectancy).reset_index(drop=True)

# Display the result
top_life_expectancy.head()
continent country year lifeExp gdpPercap
0 Africa Reunion 2007 76.442 7670.122558
1 Africa Reunion 2002 75.744 6316.165200
2 Africa Reunion 1997 74.772 6071.941411
3 Americas Canada 2007 80.653 36319.235010
4 Americas Canada 2002 79.770 33328.965070

The top_3_life_expectancy() function is applied to each group, and the results are concatenated internally with the concat() function. The output therefore has a hierarchical index whose outer level indices are the group keys.

We can also use a lambda function instead of separately defining the function top_3_life_expectancy():


# Use a lambda function to get the top 3 life expectancy values for each continent
top_life_expectancy = (
    gdp_lifeExp_data
    .groupby('continent')[['continent', 'country', 'year', 'lifeExp', 'gdpPercap']]  # Avoid adding group labels in the index
    .apply(lambda x: x.nlargest(3, 'lifeExp'))
    .reset_index(drop=True)
)

# Display the result
top_life_expectancy.head()
continent country year lifeExp gdpPercap
0 Africa Reunion 2007 76.442 7670.122558
1 Africa Reunion 2002 75.744 6316.165200
2 Africa Reunion 1997 74.772 6071.941411
3 Americas Canada 2007 80.653 36319.235010
4 Americas Canada 2002 79.770 33328.965070

15.9.2 Using transform() on Groups

The transform() function applies a function to each group and returns a Series aligned with the original DataFrame’s index. This makes it suitable for adding or modifying columns based on group-level calculations.

Recall that in the data cleaning and preparation chapter, we imputed missing values based on correlated variables in the dataset.

In the example provided, some countries had missing values for GDP per capita. To handle this, we imputed the missing GDP per capita for each country using the average GDP per capita of its corresponding continent.

Now, we’ll explore an alternative approach using groupby() and transform() to perform this imputation.

Let us read the datasets and the function that makes a visualization to compare the imputed values with the actual values.

#Importing data with missing values
gdp_missing_data = pd.read_csv('./Datasets/GDP_missing_data.csv')

#Importing data with all values
gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')
# Index of rows with missing values for GDP per capita
null_ind_gdpPC = gdp_missing_data.index[gdp_missing_data.gdpPerCapita.isnull()]

# Define a function to plot and evaluate imputed values vs actual values 
def plot_actual_vs_predicted(y, title_suffix=""):
    """
    Plot imputed vs actual GDP per capita values and display RMSE.
    
    Parameters:
    - y: DataFrame with imputed gdpPerCapita values
    - title_suffix: Optional string to add to the plot (e.g., method name)
    """
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Extract actual and imputed values
    x = gdp_complete_data.loc[null_ind_gdpPC, 'gdpPerCapita']
    y_imputed = y.loc[null_ind_gdpPC, 'gdpPerCapita']
    
    # Create scatter plot
    ax.scatter(x, y_imputed, alpha=0.6, s=50)
    
    # Add perfect prediction line (45-degree line)
    ax.plot(x, x, color='orange', linewidth=2, label='Perfect imputation')
    
    # Labels and formatting
    ax.set_xlabel('Actual GDP per capita', fontsize=14)
    ax.set_ylabel('Imputed GDP per capita', fontsize=14)
    ax.xaxis.set_major_formatter('${x:,.0f}')
    ax.yaxis.set_major_formatter('${x:,.0f}')
    ax.tick_params(labelsize=12)
    ax.grid(True, alpha=0.3)
    ax.legend(fontsize=11)
    
    # Calculate and display RMSE
    rmse = np.sqrt(np.mean((y_imputed - x)**2))
    
    # Position text dynamically based on data range
    x_pos = x.min() + (x.max() - x.min()) * 0.05
    y_pos = y_imputed.max() - (y_imputed.max() - y_imputed.min()) * 0.1
    ax.text(x_pos, y_pos, f'RMSE = ${rmse:,.2f}', 
            fontsize=13, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    plt.show()
    
    return rmse

Approach 1: Using the `.loc``

# Finding the mean GDP per capita of the continent
avg_gdpPerCapita = gdp_missing_data['gdpPerCapita'].groupby(gdp_missing_data['continent']).mean()

# Creating a copy of missing data to impute missing values
gdp_imputed_data = gdp_missing_data.copy()

# Replacing missing GDP per capita with the mean GDP per capita for the corresponding continent
for cont in avg_gdpPerCapita.index:
    gdp_imputed_data.loc[(gdp_imputed_data.continent==cont) & (gdp_imputed_data.gdpPerCapita.isnull()),
                     'gdpPerCapita'] = avg_gdpPerCapita[cont]

# Plot the actual vs predicted values
plot_actual_vs_predicted(gdp_imputed_data)

25473.20645170116

Approach 2: Using the groupby() and transform() methods.

The transform() function is a powerful tool for filling missing values in grouped data. It allows us to apply a function across each group and align the result back to the original DataFrame, making it perfect for filling missing values based on group statistics.

In this example, we use transform() to impute missing values in the gdpPerCapita column by filling them with the mean gdpPerCapita of each continent:

# Creating a copy of missing data to impute missing values
gdp_imputed_data = gdp_missing_data.copy()

# Grouping data by continent
grouped = gdp_missing_data.groupby('continent')

# Imputing missing values with the mean GDP per capita of the continent
gdp_imputed_data['gdpPerCapita'] = grouped['gdpPerCapita'].transform(lambda x: x.fillna(x.mean()))

# Plot the actual vs predicted values
plot_actual_vs_predicted(gdp_imputed_data)

25473.20645170116

Using the transform() function, missing values in gdpPerCapita for each group are filled with the group’s mean gdpPerCapita. This approach is not only more convenient to write but also faster compared to using for loops. While a for loop imputes missing values one group at a time, transform() performs built-in operations (like mean, sum, etc.) in a way that is optimized internally, making it more efficient.

Let’s use apply() instead of transform() with groupby()

Please copy the code below and run it in your notebook:

#Creating a copy of missing data to impute missing values
gdp_imputed_data = gdp_missing_data.copy()

#Grouping data by continent
grouped = gdp_missing_data.groupby('continent')

#Applying the lambda function on the 'gdpPerCapita' column of the groups
gdp_imputed_data['gdpPerCapita'] = grouped['gdpPerCapita'].apply(lambda x: x.fillna(x.mean()))

plot_actual_vs_predicted()

Why we ran into this error? and apply() doesn’t work?

Here’s a deeper look at why apply() doesn’t work as expected here:

15.9.2.1 Behavior of groupby().apply() vs. groupby().transform()

  • groupby().apply(): This method applies a function to each group and returns the result with a hierarchical (multi-level) index by default. This hierarchical index can make it difficult to align the result back to a single column in the original DataFrame.

  • groupby().transform(): In contrast, transform() is specifically designed to apply a function to each group and return a Series that is aligned with the original DataFrame’s index. This alignment makes it directly compatible for assignment to a new or existing column in the original DataFrame.

15.9.2.2 Why transform() Works for Imputation

When using transform() to fill missing values, it applies the function (e.g., fillna(x.mean())) based on each group’s statistics, such as the mean, while keeping the result aligned with the original DataFrame’s index. This allows for smooth assignment to a column in the DataFrame without any index mismatch issues.

Additionally, transform() applies the function to each element in a group independently and returns a result that has the same shape as the original data, making it ideal for adding or modifying columns.

15.9.3 Using filter() on Groups

The filter() function filters entire groups based on a condition. It evaluates each group and keeps only those that meet the specified criteria.

Example: Keep only the countries where the mean life expectancy is greater than 70

# keep only the continent where the mean life expectancy is greater than 74
gdp_lifeExp_data.groupby('continent').filter(lambda x: x['lifeExp'].mean() > 74)['continent'].unique()
array(['Oceania'], dtype=object)
# keep only the country where the mean life expectancy is greater than 74
gdp_lifeExp_data.groupby('country').filter(lambda x: x['lifeExp'].mean() > 74)['country'].unique()
array(['Australia', 'Canada', 'Denmark', 'France', 'Iceland', 'Italy',
       'Japan', 'Netherlands', 'Norway', 'Spain', 'Sweden', 'Switzerland'],
      dtype=object)

Using .nunique() get the number of countries that satisfy this condition

gdp_lifeExp_data.groupby('country').filter(lambda x: x['lifeExp'].mean() > 74)['country'].nunique()
12

15.10 Sampling data by group

If a dataset contains a large number of observations, operating on it can be computationally expensive. Instead, working on a sample of entire observations is a more efficient alterative. The groupby() method combined with apply() can be used for stratified random sampling from a large dataset.

Before taking the random sample, let us find the number of countries in each continent.

gdp_lifeExp_data.continent.value_counts()
continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

Let us take a random sample of 650 observations from the entire dataset.

sample_lifeExp_data = gdp_lifeExp_data.sample(650)

Now, let us see the number of countries of each continent in our sample.

sample_lifeExp_data.continent.value_counts()
continent
Africa      236
Asia        154
Europe      136
Americas    113
Oceania      11
Name: count, dtype: int64

Some of the continent have a very low representation in the data. To rectify this, we can take a random sample of 130 observations from each of the 5 continents. In other words, we can take a random sample from each of the continent-based groups.

evenly_sampled_lifeExp_data = gdp_lifeExp_data.groupby('continent').apply(lambda x:x.sample(130, replace=True), include_groups=False)

group_sizes = evenly_sampled_lifeExp_data.groupby(level=0).size()
print(group_sizes)
continent
Africa      130
Americas    130
Asia        130
Europe      130
Oceania     130
dtype: int64

The above stratified random sample equally represents all the continent.

15.11 corr(): Correlation by group

The corr() method of the GroupBy object returns the correlation between all pairs of columns within each group.

Example: Find the correlation between lifeExp and gdpPercap for each continent-country level combination.

gdp_lifeExp_data.groupby(['continent','country']).apply(lambda x:x['lifeExp'].corr(x['gdpPercap']), include_groups=False)
continent  country       
Africa     Algeria           0.904471
           Angola           -0.301079
           Benin             0.843949
           Botswana          0.005597
           Burkina Faso      0.881677
                               ...   
Europe     Switzerland       0.980715
           Turkey            0.954455
           United Kingdom    0.989893
Oceania    Australia         0.986446
           New Zealand       0.974493
Length: 142, dtype: float64

Life expectancy is closely associated with GDP per capita across most continent-country combinations.

15.12 Independent Study

15.12.1 Practice exercise 1

Read the spotify dataset from spotify_data.csv that contains information about tracks and artists

15.12.1.1

Find the mean and standard deviation of the track popularity for each genre.

15.12.1.2

Create a new categorical column, energy_lvl, with two levels – Low energy and High energy – using equal-sized bins based on the track’s energy level. Then, calculate the mean, standard deviation, and 90th percentile of track popularity for each genre and energy level combination

15.12.1.3

Find the mean and standard deviation of track popularity and danceability for each genre and energy level. What insights you can gain from the generated table

15.12.1.4

For each genre and energy level, find the mean and standard deviation of the track popularity, and the minimum and maximum values of loudness.

15.12.1.5

Find the most popular artist from each genre.

15.12.1.6

Filter the first 4 columns of the spotify dataset. Drop duplicate observartions in the resulting dataset using the Pandas DataFrame method drop_duplicates(). Find the top 3 most popular artists for each genre.

15.12.1.7

The spotify dataset has more than 200k observations. It may be expensive to operate with so many observations. Take a random sample of 650 observations to analyze spotify data, such that all genres are equally represented.

15.12.1.8

Find the correlation between danceability and track popularity for each genre-energy level combination.

15.12.1.9

Find the number of observations in each group, where each groups corresponds to a distinct genre-energy lvl combination

15.12.1.10

Find the percentage of observations in each group of the above table.

15.12.1.11

What percentage of unique tracks are contributed by the top 5 artists of each genre?

Hint: Find the top 5 artists based on artist_popularity for each genre. Count the total number of unique tracks (track_name) contributed by these artists. Divide this number by the total number of unique tracks in the data. The nunique() function will be useful.